Overall Goal

  • Write functions that will manipulate and process data sets that come in a certain form
  • Create a generic function to automatically plot the returned data

We will be using csv files from the Census Bureaus. The first two data sets we use will contain information about public school enrollment while the last four will have information about resident population estimates.

We will first read in and parse the data through separate tasks followed by doing the same thing on a similar data set but converting the tasks into functions that can be used to repeat the process through one function call. We will then call the functions made to read in and parse the enrollment data sets followed by creating a new function to combine them. From the combined data sets, we will write generic functions for summarizing and plotting based on the class object. The program will be concluded by putting all the newly created functions together and practicing on the resident population csv files. This will allow us to read in the data, process the data, combine the data, summarize the data, and lastly plot the data.

Data Processing

First Data Set

The first thing we are going to do is read in a section of the data that we are going to analyze and use to make a function.

library(readr)
library(tidyverse)

sheet1 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")

Now we are going to filter the duplicate row and select the following columns in order to trim the data we are analyzing. The columns we are using have at least one of the following characteristics:
* Area_name (which we will rename as area_name)
* STCOU
* Any column that ends in “D”

sheet1 <- sheet1 %>%
  filter(Area_name != "District of Columbia") %>%
  dplyr::select(Area_name, STCOU, ends_with("D")) %>%
  rename(area_name = Area_name)

sheet1

We can see how our data is structured and that it is categorized with the country overall statistics first, then the state, and lastly county level data.

Lets convert the data into a long format using the pivot_longer function where each row will have the area_name with only one enrollment value.

sheet1_long <- sheet1 %>%
  pivot_longer(
    cols = ends_with("D"),
    names_to = "education_measurement",
    values_to = "enrollment")

sheet1_long

We have now changed the data into a longer format where only one education measurement is in each row and its corresponding value is highlighted in the next column.

Now we are going to parse the education_measurement column to first have the type of survey which is the first three letters followed by the next four digits and then also pull the year from the third to last and second to last elements in the string. We will be using the substr() function to do this. Lastly, we are going to use the dplyr::select() and everything() functions to reorder the columns to make it easier to follow. This part is our personal preference.

# Now get the year from the education_measurement column
sheet1_long$year <- substr(sheet1_long$education_measurement, 
                                 start = nchar(sheet1_long$education_measurement)-2, 
                                 stop = nchar(sheet1_long$education_measurement)-1)

sheet1_long$year <- as.numeric(sheet1_long$year)
# This allows us to make sure the survey was done in 1900s or 2000s
sheet1_long$year <- ifelse(sheet1_long$year > 25, 1900 + sheet1_long$year, 2000 + sheet1_long$year)


# Now get the first 3 letters and next 4 digits to see what type of survey was taken
sheet1_long$education_measurement <- substr(sheet1_long$education_measurement, 
                                            start = 1, 
                                            stop = 7)

# Order the data to get the year in front of the measurement
sheet1_long <- sheet1_long %>%
  dplyr::select(area_name, STCOU, year, everything())

# Display this new data
sheet1_long

We can see that our data has the year and the type of survey (or measurement of education) that was taken for corresponding areas. We want to separate the data into two different tibbles. The first is one on a county level, which will be called sheet1_county_data and the other which is not county level data, which will be called sheet1_not_county_data. We will use the filter() and grepl() functions to find the pattern of “County Name, DD” to make that a part of sheet1_county_data and data that is not a part of this pattern a part of sheet1_not_county_data. We are also going to create a class called “county” for sheet1_county_data and a class called “state” for sheet1_not_county_data.

# Here we are going to make the county level data
sheet1_county_data <- sheet1_long %>% 
  filter(grepl(pattern = ", \\w\\w", area_name))
# Create the new class called county
class(sheet1_county_data) <- c("county", class(sheet1_county_data))
sheet1_county_data
# Here we are going to make the non-county level data
sheet1_not_county_data <- sheet1_long %>% 
  filter(!grepl(pattern = ", \\w\\w", area_name))
# Create the new class called state
class(sheet1_not_county_data) <- c("state", class(sheet1_not_county_data))
sheet1_not_county_data

After seeing how our two datasets look, we are going to add to the sheet1_county_data tibble by creating a new variable called state which shows what state each county is in. We are going to use the substr() function to do this. Lastly using dplyr::select() and everything() functions, we are going to reorder the columns for them to make sense.

# Get the state from each county and make it a new column
sheet1_county_data$state <- substr(sheet1_county_data$area_name,
                                   start = nchar(sheet1_county_data$area_name)-1,
                                   stop = nchar(sheet1_county_data$area_name))

# Reorder the columns with county and then state column following it
sheet1_county_data <- sheet1_county_data %>% 
  dplyr::select(area_name, state, everything())

# Show the updated tibble
sheet1_county_data

With adding the state to the sheet1_county_data tibble, we now want to add the division of the states to the sheet1_not_county_data, which we will call division. The divisions will be created based on the Census Bureau. This new variable will be created by using the %in% operator. If the division does not correspond to a state, we want to return “Error” so we know it is not a US state. Lastly using dplyr::select() and everything() functions, we are going to reorder the columns for them to make sense.

# Make all the divisions based on classification
division_1 <- toupper(c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"))
division_2 <- toupper(c("New Jersey", "New York", "Pennsylvania"))
division_3 <- toupper(c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"))
division_4 <- toupper(c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"))
division_5 <- toupper(c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia",
                        "District of Columbia", "West Virginia"))
division_6 <- toupper(c("Alabama", "Kentucky", "Mississippi", "Tennessee"))
division_7 <- toupper(c("Arkansas", "Louisiana", "Oklahoma", "Texas"))
division_8 <- toupper(c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"))
division_9 <- toupper(c("Alaska", "California", "Hawaii", "Oregon", "Washington"))

# Create a new column called division to put the states in the correct spot. We will then sort the data so the division is next to the state
sheet1_not_county_data <- sheet1_not_county_data %>%
  mutate(
    division = ifelse(
      area_name %in% division_1, "New England", ifelse(
        area_name %in% division_2, "Middle Atlantic", ifelse(
          area_name %in% division_3, "East North Central", ifelse(
            area_name %in% division_4, "West North Central", ifelse(
              area_name %in% division_5, "South Atlantic", ifelse(
                area_name %in% division_6, "East South Central", ifelse(
                  area_name %in% division_7, "West South Central", ifelse(
                    area_name %in% division_8, "Mountain", ifelse(
                      area_name %in% division_9, "Pacific", "ERROR"
                    )
                  )
                )
              )
            )
          )
        )
      )
    )
  ) %>% 
  dplyr::select(area_name, division, everything())

# Print the updated tibble
sheet1_not_county_data

Second Data Set

Now we want to repeat this process with using functions that we are going to create. Before we do that, we are going to do is read in our next section of data that we are going to analyze and use to make function. The code we are using is the same that was used in the previous section.

sheet2 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")

We are going to start making our functions to that did all the previous things we were asked to do for our first set of data. The first function we are going to create called reshaping_data() will dplyr::select the columns we need and then pivot the data into the longer format, like what was done in the previous section. We are also going to remove the duplicate row for the “District of Columbia”

reshaping_data <- function(tibble, value_column = "enrollment") {
  # Here we are selecting and renaming our desired columns
  tibble <- tibble %>% 
    filter(Area_name != "District of Columbia") %>%
    dplyr::select(Area_name, STCOU, ends_with("D")) %>% 
    rename(area_name = Area_name)
  
  # Here we are changing our data into the long format
  tibble_long <- tibble %>%
  pivot_longer(
    cols = ends_with("D"),
    names_to = "education_measurement",
    values_to = value_column)
  # Now we are going to return the long formatted data
  return(tibble_long)
}

We want to create another function, which will be called find_year_and_survey(). This function will allow us to parse the education_measurement column to first have the type of survey which is the first three letters followed by the next four digits and then also pull the year from the third to last and second to last elements in the string. We will be using the substr() function within our newly created function to do this. Lastly, we are going to use the dplyr::select() and everything() functions within our function to reorder the columns to make it easier to follow. This part is our personal preference, as we did before with the first data set.

find_year_and_survey <- function(tibble_long) {
  # Now get the year from the education_measurement column
  tibble_long$year <- substr(tibble_long$education_measurement, 
                              start = nchar(tibble_long$education_measurement)-2, 
                              stop = nchar(tibble_long$education_measurement)-1)
  tibble_long$year <- as.numeric(tibble_long$year)
  # This allows us to make sure the survey was done in 1900s or 2000s  
  tibble_long$year <- ifelse(tibble_long$year > 25, 1900 + tibble_long$year, 2000 + tibble_long$year)

  # Now get the first 3 letters and next 4 digits to see what type of survey was taken
  tibble_long$education_measurement <- substr(tibble_long$education_measurement, 
                                              start = 1, 
                                              stop = 7)

  # Order the data to get the year in front of the measurement
  tibble_long <- tibble_long %>%
    dplyr::select(area_name, STCOU, year, everything())
  
  # Return this new data
  return(tibble_long)
}

For the county-level tibble, we are going to create a function called find_state() which will allow us to create a new variable called state that describes which state one of these county measurements corresponds to using substr() within our function. We are also going to reorder the columns to put the new state variable next to the area_name or county of that state.

find_state <- function(tibble_county_data) {
  # Get the state from each county and make it a new column
  tibble_county_data$state <- substr(tibble_county_data$area_name, 
                                     start = nchar(tibble_county_data$area_name)-1, 
                                     stop = nchar(tibble_county_data$area_name))

  # Reorder the columns with county and then state column following it
  tibble_county_data <- tibble_county_data %>% 
    dplyr::select(area_name, state, everything())
  
  # Show the updated tibble
  return(tibble_county_data)
}

Now for the non-county-level tibble, we are going to create a function called find_division() which will allow us to create a new variable called division, which is based on the US divisions defined by the Census Bureau. This new variable will be created by using the %in% operator. If the division does not correspond to a state, we want to return “Error” so we know it is not a US state. Lastly using dplyr::select() and everything() functions within our function, we are going to reorder the columns for them to make sense.

find_division <- function(tibble_non_county_data) {
  # Make all the divisions based on classification
  division_1 <- toupper(c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"))
  division_2 <- toupper(c("New Jersey", "New York", "Pennsylvania"))
  division_3 <- toupper(c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"))
  division_4 <- toupper(c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"))
  division_5 <- toupper(c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", 
                          "District of Columbia", "West Virginia"))
  division_6 <- toupper(c("Alabama", "Kentucky", "Mississippi", "Tennessee"))
  division_7 <- toupper(c("Arkansas", "Louisiana", "Oklahoma", "Texas"))
  division_8 <- toupper(c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"))
  division_9 <- toupper(c("Alaska", "California", "Hawaii", "Oregon", "Washington"))

  # Create a new column called division to put the states in the correct spot. We will then sort the data so the division is next to the state
  tibble_non_county_data <- tibble_non_county_data %>%
    mutate(division = ifelse(
      area_name %in% division_1, "New England", ifelse(
        area_name %in% division_2, "Middle Atlantic", ifelse(
          area_name %in% division_3, "East North Central", ifelse(
            area_name %in% division_4, "West North Central", ifelse(
              area_name %in% division_5, "South Atlantic", ifelse(
                area_name %in% division_6, "East South Central", ifelse(
                  area_name %in% division_7, "West South Central", ifelse(
                    area_name %in% division_8, "Mountain", ifelse(
                      area_name %in% division_9, "Pacific", "ERROR"
                      )
                    )
                  )
                )
              )
            )
          )
        )
      )
    ) %>% 
    dplyr::select(area_name, division, everything())
  # Return the updated tibble
  return(tibble_non_county_data)
}

Our next function will allow us to split the data into two different tibbles, one based on county-level data and the other by non-county-level data, similar to what we did with the first data source. We are going to call this function split_into_level_type(). We are also going to combine our last two created functions with this one to return the two correctly structured data sets we desire.

split_into_level_type <- function(tibble_long) {
  # Here we are going to split our data into county level and non-county level data
  # First split into county level data
  tibble_county_data <- tibble_long %>% 
    filter(grepl(pattern = ", \\w\\w", area_name))
  # Create the new class called county
  class(tibble_county_data) <- c("county", class(tibble_county_data)) 

  # Here we are going to make the non-county level data
  tibble_non_county_data <- tibble_long %>% 
    filter(!grepl(pattern = ", \\w\\w", area_name))
   # Create the new class called state
  class(tibble_non_county_data) <- c("state", class(tibble_non_county_data))
  
  # Now combine our county-level function from before to get the state
  tibble_county_data_updated <- find_state(tibble_county_data)
  
  # Now combine our non-county-level function from before to get the division
  tibble_non_county_data_updated <- find_division(tibble_non_county_data)
  
  # Return the two tibbles
  return(list(county = tibble_county_data_updated, state = tibble_non_county_data_updated))
}

Now the last thing we are going to do is put all the functions into one function call. This will allow us to call only one function to get all the things we want with a URL we are reading in for our data. We are going to call this function wrapping_data_processing_functions().

wrapping_data_processing_functions <- function(url, var_name = "enrollment") {
  # Read in the data through the URL
  tibble <- read_csv(url)
  
  # Now reshape the tibble with how we want it to look with columns selected and removing the duplicate row
  tibble_long <- reshaping_data(tibble, value_column = var_name)
  
  # Now we want to find the year and survey type of the data
  tibble_long_updated <- find_year_and_survey(tibble_long)
  
  # Lastly we want to split the data into county and non-county (state) level data and include necessary information
  tibble_education_data_list <- split_into_level_type(tibble_long_updated)
  
  # Return the final list of two tibbles
  return(tibble_education_data_list)
}

Call it and Combine Our Data

Call the wrapping_data_processing_functions two times to read in and parse the two .csv files we have previously worked with and store them in objects to be combined later.

test1 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv",
                                               var_name = "enrollment")
test2 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv",
                                               var_name = "enrollment")

Write a short function called combine_census that takes the results of the two calls to our wrapping_data_processing_functions and combines the tibbles. That is, the two county level data sets get combined and the two non-county level data sets get combined.

combine_census <- function(wrapper1, wrapper2) {
  #use dplyr bind_rows to combine tibbles, here indexing on the returned list from the wrapper function for county
  county_data_merge <- dplyr::bind_rows(wrapper1[["county"]], wrapper2[["county"]])
  #index the list for the named element state
  non_county_data_merge <- dplyr::bind_rows(wrapper1[["state"]], wrapper2[["state"]])
  #retrun the two data sets as one object using list function
  return(list(county = county_data_merge, state = non_county_data_merge))
}

#call the combine_census function to combine the result of the two calls to the wrapper function
test3 <- combine_census(test1, test2)
test3
## $county
## # A tibble: 62,900 × 6
##    area_name   state STCOU  year education_measurement enrollment
##    <chr>       <chr> <chr> <dbl> <chr>                      <dbl>
##  1 Autauga, AL AL    01001  1987 EDU0101                     6829
##  2 Autauga, AL AL    01001  1988 EDU0101                     6900
##  3 Autauga, AL AL    01001  1989 EDU0101                     6920
##  4 Autauga, AL AL    01001  1990 EDU0101                     6847
##  5 Autauga, AL AL    01001  1991 EDU0101                     7008
##  6 Autauga, AL AL    01001  1992 EDU0101                     7137
##  7 Autauga, AL AL    01001  1993 EDU0101                     7152
##  8 Autauga, AL AL    01001  1994 EDU0101                     7381
##  9 Autauga, AL AL    01001  1995 EDU0101                     7568
## 10 Autauga, AL AL    01001  1996 EDU0101                     7834
## # ℹ 62,890 more rows
## 
## $state
## # A tibble: 1,040 × 6
##    area_name     division STCOU  year education_measurement enrollment
##    <chr>         <chr>    <chr> <dbl> <chr>                      <dbl>
##  1 UNITED STATES ERROR    00000  1987 EDU0101                 40024299
##  2 UNITED STATES ERROR    00000  1988 EDU0101                 39967624
##  3 UNITED STATES ERROR    00000  1989 EDU0101                 40317775
##  4 UNITED STATES ERROR    00000  1990 EDU0101                 40737600
##  5 UNITED STATES ERROR    00000  1991 EDU0101                 41385442
##  6 UNITED STATES ERROR    00000  1992 EDU0101                 42088151
##  7 UNITED STATES ERROR    00000  1993 EDU0101                 42724710
##  8 UNITED STATES ERROR    00000  1994 EDU0101                 43369917
##  9 UNITED STATES ERROR    00000  1995 EDU0101                 43993459
## 10 UNITED STATES ERROR    00000  1996 EDU0101                 44715737
## # ℹ 1,030 more rows

Writing a Generic Function for Summarizing

Since we have our own classes for county and state, we can write our own custom plot function for these using by adding a method to the plot function. For the state plotting method, we want to write a function that plots the mean value of the enrollment statistic across the years for each Division.

#add the .state method to the plot function, adding an argument for which variable we want to calculate the mean for
plot.state <- function(df, var_name = "enrollment") {
  new_df <- df %>%
    #since we want the mean for each year and division, group by these variables
    group_by(division, year) %>%
    #use the summarise function to calculate the mean and the get function to reference the var_name
    summarise(avg_enrollment = mean(get(var_name))) %>%
    #use the filter function to remove the ERROR setting of the Division variable
    filter(division != "ERROR")
  #use ggpplot to plot the numeric year value on the x-axis, and the mean of the statistic for each Division and numeric year on the y-axis
  ggplot(new_df, aes(x = year, y = avg_enrollment, color = division)) + 
    geom_line()
}

For the class county, we want to do a similar plotting function but with more flexibility. This function should allow the user to specify the state of interest, determine whether the top or bottom most counties should be looked at, and instruct how many of the top or bottom will be investigated.

#specify default arguments as enrollment, AL, top, and 5 for the arguments included for the function
plot.county <- function(df, var_name = "enrollment", filter_state = "AL", filter_county = "top", filter_value = 5) {
#return a 1 if top is selected, 0 for bottom using the ifelse base r function, and the stop function to end the task if top or bottom is not used properly. this object will be used to multiply against the statistic vector created below to determine descending or ascending order
filter_multiplier <- ifelse(filter_county == "top", 1,
                            ifelse(filter_county == "bottom", 0,
                                   stop("Must select either 'top' or 'bottom' as options for filter_county; default is 'top'")))
new_df <- df %>%
  #filter the county data to only include data from the state specified
  filter(state == filter_state) %>%
  #group by the area name
  group_by(area_name) %>%
  #use the summarise function to find the overall mean of the variable specified, using the get function here as well
  summarise(avg_enrollment = mean(get(var_name))) %>%
  #sort the summarise statistic vector using the arrange function from largest to smallest if top (1 in filter_multiplier) is specified or smallest to largest if bottom (0 in the filter_multiplier) is specified. the value stored in filter_multiplier will allow us to run a conditional argument on a vector (avg_enrollment) where we check if the returned vector multiplied by the filter_multiplier is 0 for all values, or anything other than 0. in order to sort the statistic values from largest to smallest, use the desc function since smallest to largest is the default for the arrange function
  arrange(if_else((avg_enrollment * filter_multiplier) > 0, desc(avg_enrollment), avg_enrollment)) %>%
  #obtain the top or bottom x number of area_name from the previous step indexing from 1 to the filter_value given by the user or the default value
  slice(1:filter_value)
#filter the data to only include the state specified by the user and the area_name's found in new_df (previous part)
filtered_df <- df %>%
  filter((state == filter_state) & (area_name %in% new_df$area_name))
#use ggpplot to plot the numeric year value on the x-axis, and the numeric variable specified under var_name on the y-axis where a different color is used for each area_name
ggplot(filtered_df, aes(x = year, y = get(var_name), color = area_name)) + 
   geom_line()
}

Put it Together

Here we will combine all the newly created functions to do all of the above in a few function calls. First we will run our data processing function (wrapping_data_processing_functions) on the two enrollment URLs used previously, specifying the name for the enrollment column. We will then use our coming function (combine_census) to put these into one object with two data frames.

wrapper1 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv",
                                               var_name = "enrollment")
wrapper2 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv",
                                               var_name = "enrollment")
data_combine <- combine_census(wrapper1, wrapper2)

Use the plot function on the state data frame

#index using the named second list element, "state"
plot(data_combine[["state"]])

Use the plot function on the county data frame using NC as the state, top being the group, and 10 as the number looked at

#index using the named second list element, "county"
plot(data_combine[["county"]], filter_state = "NC", filter_county = "top", filter_value = 10)

Use the plot function on the county data frame using AZ as the state, bottom being the group, and 6 as the number looked at

plot(data_combine[["county"]], filter_state = "AZ", filter_county = "bottom", filter_value = 6)

Use the plot function on the county data frame using only the default values for the arguments

plot(data_combine[["county"]])

Use the plot function on the county data frame using OH as the state, top being the group, and 8 as the number looked at

plot(data_combine[["county"]], filter_state = "OH", filter_county = "top", filter_value = 8)

Lets read in and combine another couple data sets using the wrapping_data_processing_functions and conbine_census functions and apply these functions by reading in four other data sets at the URLs given below

wrapper3 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01a.csv",
                                               var_name = "enrollment")
wrapper4 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01b.csv",
                                               var_name = "enrollment")
wrapper5 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01c.csv",
                                               var_name = "enrollment")
wrapper6 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01d.csv",
                                               var_name = "enrollment")

data_combine2 <- combine_census(wrapper3, wrapper4)
data_combine3 <- combine_census(data_combine2, wrapper5)
data_combine4 <- combine_census(data_combine3, wrapper6)

Use the plot function on the state data frame

plot(data_combine4[["state"]])

Use the plot function on the county data frame using PA as the state, top being the group, and 5 as the number looked at

plot(data_combine4[["county"]], filter_state = "PA", filter_county = "top", filter_value = 5)

Use the plot function on the county data frame using TX as the state, bottom being the group, and 12 as the number looked at

plot(data_combine4[["county"]], filter_state = "TX", filter_county = "bottom", filter_value = 12)

Use the plot function on the county data frame using only the default values for the arguments

plot(data_combine4[["county"]])

Use the plot function on the county data frame using NY as the state, top being the group, and 6 as the number looked at

plot(data_combine4[["county"]], filter_state = "NY", filter_county = "top", filter_value = 6)